Skip to content

[TMA] Update TMAStoreWaitOp to wait for the memory write to complete#10415

Merged
peterbell10 merged 1 commit into
mainfrom
pb/tma-store-wait
May 29, 2026
Merged

[TMA] Update TMAStoreWaitOp to wait for the memory write to complete#10415
peterbell10 merged 1 commit into
mainfrom
pb/tma-store-wait

Conversation

@peterbell10

Copy link
Copy Markdown
Contributor

Currently desc.store(...) does not guaruntee that that write is completed to global memory, so this makes message passing impossible.

e.g.

desc.store(...)
tl.atomic_xchg(flag, 1, sem="release")

does not release the store.

To maintain perf in gluon, I also expose the read-only tma wait variant so users can explicitly opt-in to the behavior.

Currently desc.store(...) does not guaruntee that that write is
completed to global memory, so this makes message passing impossible.

e.g.
```
desc.store(...)
tl.atomic_xchg(flag, 1, sem="release")
```

does not release the store.

To maintain perf in gluon, I also expose the read-only tma wait variant
so users can explicitly opt-in to the behavior.
@peterbell10 peterbell10 requested a review from ThomasRaoux May 29, 2026 17:24
@peterbell10 peterbell10 requested a review from ptillet as a code owner May 29, 2026 17:24

@ThomasRaoux ThomasRaoux left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ThomasRaoux ThomasRaoux left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually thinking more about this if it is true it means something is broken in the acquire/release semantic of ptx ops. Are you sure the example you have doesn't work?

@peterbell10

Copy link
Copy Markdown
Contributor Author

@ThomasRaoux

Copy link
Copy Markdown
Collaborator

https://triton-lang.slack.com/archives/C04CZ1MCL65/p1780084322002219

sad, anyway makes sense, hopefully it doesn't affect perf significantly

@peterbell10 peterbell10 merged commit 02480ad into main May 29, 2026
10 checks passed
@peterbell10 peterbell10 deleted the pb/tma-store-wait branch May 29, 2026 20:18
ThomasRaoux added a commit that referenced this pull request Jun 2, 2026
Use read-only waits for pipelined TMA stores so the in-loop wait only
protects shared-memory staging buffer reuse.

Skip TMA store pipelining when the loop contains acquire or release
atomics so those memory-ordering cases keep the non-pipelined descriptor
store lowering.

This recovers some performance regression caused by
[#10415](#10415)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants